AI Bootcamp
Data science is an interdisciplinary field that combines:
Goal: Extract knowledge, insights, and discover hidden patterns from structured and unstructured data collected from various sources (web, smartphones, sensors, customers, etc.).
Data science or data-driven science enables better decision making, predictive analysis, and pattern discovery:
In data science and big data, the user may come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these:
Introduces objects for multidimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects
Provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance
Link: http://www.numpy.org/
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
NumPy arrays are preferred over lists and tuples for their efficiency, especially when working with large datasets.
import numpy as np
# 1D Array
array_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", array_1d)
print("Shape:", array_1d.shape)
# 2D Array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", array_2d)
print("Shape:", array_2d.shape)
# 3D Array
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("3D Array:\n", array_3d)
print("Shape:", array_3d.shape)
NumPy provides many functions to create arrays:
import numpy as np
a = np.zeros((2,2)) # Create an array of all zeros
print(a) # Prints "[[ 0. 0.]
# [ 0. 0.]]"
b = np.ones((1,2)) # Create an array of all ones
print(b) # Prints "[[ 1. 1.]]"
c = np.full((2,2), 7) # Create a constant array
print(c) # Prints "[[ 7. 7.]
# [ 7. 7.]]"
d = np.eye(2) # Create a 2x2 identity matrix
print(d) # Prints "[[ 1. 0.]
# [ 0. 1.]]"
e = np.random.random((2,2)) # Create an array filled with random values
print(e) # Might print "[[ 0.91940167 0.08143941]
# [ 0.68744134 0.87236687]]"
Example 1: 1D Array Indexing
Note
Example 2: 2D Array Indexing
import numpy as np
# Create a 2D NumPy array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Access elements by row and column index
arr_2d[0, 0] # 1 (first row, first column)
arr_2d[1, 2] # 6 (second row, third column)
arr_2d[2, 1] # 8 (third row, second column)
arr_2d[-1, -1] # 9 (last row, last column)
Example 3: 3D Array Indexing
import numpy as np
# Create a 3D NumPy array
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
# Access elements by depth, row, and column index
arr_3d[0, 0, 0] # 1 (first depth, first row, first column)
arr_3d[0, 1, 1] # 4 (first depth, second row, second column)
arr_3d[1, 0, 1] # 6 (second depth, first row, second column)
arr_3d[1, 1, 0] # 7 (second depth, second row, first column)
Example 1: 1D Array Slicing
import numpy as np
# Create a 1D NumPy array
arr = np.array([10, 20, 30, 40, 50, 60, 70])
# Slicing: arr[start:stop:step]
arr[1:4] # [20, 30, 40] (index 1 to 3)
arr[:3] # [10, 20, 30] (start to index 2)
arr[3:] # [40, 50, 60, 70] (index 3 to end)
arr[::2] # [10, 30, 50, 70] (every 2nd element)
arr[::-1] # [70, 60, 50, 40, 30, 20, 10] (reverse)
Note
Example 2: 2D Array Slicing
import numpy as np
# Create a 2D NumPy array
arr_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# Slicing rows and columns
arr_2d[0:2, 1:3] # [[2, 3], [6, 7]] (rows 0-1, columns 1-2)
arr_2d[:, 2] # [3, 7, 11] (all rows, column 2)
arr_2d[1, :] # [5, 6, 7, 8] (row 1, all columns)
arr_2d[::2, ::2] # [[1, 3], [9, 11]] (every 2nd row & column)
Example 3: 3D Array Slicing
import numpy as np
# Create a 3D NumPy array
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
# Slicing across dimensions
arr_3d[0:2, :, :] # First 2 depths, all rows and columns
arr_3d[:, 0, :] # All depths, first row, all columns: [[1, 2], [5, 6], [9, 10]]
arr_3d[:, :, 1] # All depths, all rows, second column: [[2, 4], [6, 8], [10, 12]]
| Attribute | Description | Example | Result |
|---|---|---|---|
ndim |
Number of dimensions | numpy_ex.ndim |
2 |
shape |
Size in each dimension | numpy_ex.shape |
(2, 3) |
size |
Total number of elements | numpy_ex.size |
6 |
dtype |
Data type of elements | numpy_ex.dtype |
int64 |
T |
Transpose the array | numpy_ex.T |
[[1,4],[2,5],[3,6]]
|
| Operation | Description | Example | Result |
|---|---|---|---|
+ |
Element-wise addition | a + b |
[6, 8, 10, 12]
|
- |
Element-wise subtraction | a - b |
[-4, -4, -4, -4]
|
* |
Element-wise multiplication | a * b |
[5, 12, 21, 32]
|
/ |
Element-wise division | a / b |
[0.2, 0.33, 0.43, 0.5]
|
** |
Element-wise power | a ** 2 |
[1, 4, 9, 16]
|
sum() |
Sum of all elements | np.sum(a) |
10 |
mean() |
Average value | np.mean(a) |
2.5 |
min() |
Minimum value | np.min(a) |
1 |
max() |
Maximum value | np.max(a) |
4 |
dot() |
Dot product | a.dot(b) |
70 |
reshape() |
Change array shape | a.reshape(2, 2) |
[[1, 2], [3, 4]]
|
Adds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R)
Provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.
Allows handling missing data
A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them.
Series can be created from a scalar, a list, ndarray or dictionary
using pd.Series() (note the capital “S”).
Example 1: From a List
Pandas DataFrames are your new best friend. They are like the Excel spreadsheets you may be used to.
DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data.
Example 1: Basic DataFrame
Example 2: DataFrame with Custom Labels
There are several main ways to select data from a DataFrame:
| Method | Description | Example | Output |
|---|---|---|---|
[] |
Select column(s) | df["Name"] |
["Shang", "Yuttey",
"Sakada"]
|
.loc[] |
Label-based indexing | df.loc[0, "Name"] |
"Shang" |
df.loc[0:1, ["Name",
"Courses"]]
|
Rows 0-1, Name & Courses columns | ||
.iloc[] |
Integer position-based | df.iloc[0, 0] |
"Shang" |
df.iloc[0:2, 0:2] |
First 2 rows, first 2 columns | ||
| Boolean | Condition-based filtering | df[df["Courses"] > 5] |
Rows where Courses > 5 |
.query() |
SQL-like string query |
df.query("Language ==
'Python'")
|
Rows with Python language |
| Method | Syntax | Output |
|---|---|---|
| Select column | df[col_label] |
Series |
| Select row slice | df[row_1_int:row_2_int] |
DataFrame |
| Select row/column by label | df.loc[row_label(s), col_label(s)] |
Object for single selection, Series for one row/column, otherwise DataFrame |
| Select row/column by integer | df.iloc[row_int(s), col_int(s)] |
Object for single selection, Series for one row/column, otherwise DataFrame |
| Select by row integer & column label | df.loc[df.index[row_int], col_label] |
Object for single selection, Series for one row/column, otherwise DataFrame |
| Select by row label & column integer | df.loc[row_label, df.columns[col_int]] |
Object for single selection, Series for one row/column, otherwise DataFrame |
| Select by boolean | df[bool_vec] |
Object for single selection, Series for one row/column, otherwise DataFrame |
| Select by boolean expression | df.query("expression") |
Object for single selection, Series for one row/column, otherwise DataFrame |
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.
Common Methods:
pd.read_csv() - Load CSV file
pd.read_excel() - Load Excel file
To load your data from various file types such as CSV, Excel, or JSON, first define your file path:
import pandas as pd
import numpy as np
# Define file path
file_path = "https://raw.githubusercontent.com/MorkMongkul/AI-Bootcamp-Instinct/main/Data/Titanic-Dataset.csv"
# Load CSV file
df = pd.read_csv(file_path)
# Load Excel file
df = pd.read_excel(file_path)
# Load JSON file
df = pd.read_json(file_path)
Note
pd.read_csv() - Load CSV filepd.read_excel() - Load Excel filepd.read_json() - Load JSON fileAfter loading your data, use these methods to inspect and understand your dataset:
| Method | Description | Example |
|---|---|---|
df.head(n) |
View first n rows (default: 5) | df.head(20) → First 20 rows |
df.tail(n) |
View last n rows (default: 5) | df.tail(10) → Last 10 rows |
df.shape |
Get dimensions (rows, columns) | df.shape → (891, 12) |
df.columns |
Get all column names | df.columns → Index of column names |
df.info() |
Get column info, dtypes, missing values | Shows full data overview |
df.describe() |
Get statistical summary | Mean, std, min, max, quartiles |
Recommendation
Run df.info() immediately after loading your
data. It provides a comprehensive view of column names, data
types, and missing values - giving you a quick understanding
of your data and any issues to handle.
Use the Titanic df loaded earlier to select rows and
columns in different ways:
1) Select column(s) by name
2) Select rows by index (iloc – Integer Location)
3) Select rows by label (loc – Label Location)
4) Boolean indexing
5) Query syntax
Tip
When combining multiple conditions, wrap each condition in
parentheses: (cond1) & (cond2) or
(cond1) | (cond2).
| Method | Used For |
|---|---|
df["col"] |
Select one column |
df[["c1","c2"]] |
Select multiple columns |
iloc |
Position-based selection |
loc |
Label & condition-based |
| Boolean indexing | Filtering rows |
query() |
Readable conditions |
head() / tail() |
Inspection |
sample() |
Random sampling |
isna() |
Missing value filtering |
select_dtypes() |
Feature selection |
Use Titanic df to compute descriptive statistics and
aggregations.
Descriptive Statistics
GroupBy Aggregations
Note
groupby() is powerful for aggregations like
mean(), sum(),
count(), min(),
max().
Detect, remove, or impute missing data to prepare for analysis.
Best Practice
Prefer imputation over dropping rows to preserve data. Choose appropriate statistics (median for skewed data, mean for normal distributions).
Convert data types and encode categorical variables.
For Machine Learning
Use pd.get_dummies() for one-hot encoding or
sklearn.preprocessing.LabelEncoder for more
robust categorical handling.
Identify and remove duplicate rows from your dataset.
Advanced Options
subset=["col1", "col2"]
to check duplicates based on specific columns
keep="last" or
keep=False to control which duplicates to
keep
Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
A set of functionalities similar to those of MATLAB
Line plots, scatter plots, barcharts, histograms, pie charts etc.
Relatively low-level; some effort needed to create advanced visualization
Link: https://matplotlib.org/
Installation:
Import with Alias:
| Component | Code Example |
|---|---|
| Figure | plt.figure(figsize=(width, height)) |
| Plot |
plt.plot(x, y, marker='o',
label="label1")plt.plot(x, y2, marker='s',
label="label2")
|
| Labels |
plt.xlabel("X-axis Label")plt.ylabel("Y-axis Label")plt.title("Plot Title")
|
| Ticks |
plt.xticks([x_values])plt.yticks([y_values])
|
| Legend | plt.legend() |
| Gridlines | plt.grid(True) |
| Display | plt.show() |
Example:
import matplotlib.pyplot as plt
# Sample data
exams = [1, 2, 3, 4, 5]
math_scores = [60, 65, 70, 78, 85]
science_scores = [58, 63, 68, 75, 80]
plt.figure(figsize=(7, 5))
plt.plot(exams, math_scores, marker='o', label="Math")
plt.plot(exams, science_scores, marker='s', label="Science")
plt.xlabel("Exam Number")
plt.ylabel("Score")
plt.title("Student Exam Scores")
plt.xticks(exams)
plt.yticks([50, 60, 70, 80, 90])
plt.legend()
plt.grid(True)
plt.show()
| Component | Code Example |
|---|---|
| Figure & Axes | fig, ax = plt.subplots() |
| Plot |
ax.plot(x, y, label='label1',
marker='o')ax.plot(x, y2, label='label2',
marker='s')
|
| Labels |
ax.set_xlabel('X-axis Label')ax.set_ylabel('Y-axis Label')ax.set_title('Plot Title')
|
| Ticks |
ax.set_xticks([x_values])ax.set_yticks([y_values])
|
| Legend | ax.legend() |
| Gridlines |
ax.grid(True, linestyle='--', alpha=0.7)
|
| Display | plt.show() |
Example:
import matplotlib.pyplot as plt
# Sample data
exams = [1, 2, 3, 4, 5]
math_scores = [60, 65, 70, 78, 85]
science_scores = [58, 63, 68, 75, 80]
# Create figure and axes
fig, ax = plt.subplots()
# Plot lines
ax.plot(exams, math_scores, label='Math Scores', marker='o')
ax.plot(exams, science_scores, label='Science Scores', marker='s')
# Add labels
ax.set_xlabel('Exam Number') # X-axis label
ax.set_ylabel('Scores') # Y-axis label
ax.set_title('Student Performance') # Title
# Customize ticks
ax.set_xticks(exams) # X-tick positions
ax.set_yticks([50, 60, 70, 80, 90]) # Y-tick positions
# Add legend and gridlines
ax.legend() # Legend
ax.grid(True, linestyle='--', alpha=0.7) # Gridlines
plt.show()
Different plot types are suited for different data types and analysis goals:
| Plot Type | Best For | Data Type |
|---|---|---|
| Line Plot | Trends over time/continuous data | Time series, continuous variables |
| Scatter Plot | Relationships between variables | Two continuous variables |
| Bar Chart | Comparing categories | Categorical vs numerical |
| Histogram | Distribution of single variable | Single continuous variable |
| Pie Chart | Parts of a whole (proportions) | Categorical data (percentages) |
| Box Plot | Distribution & outliers | Continuous data across categories |
| Heatmap | Correlation or matrix data | 2D array/matrix data |
Use Case: Show trends over time or continuous relationships
When to Use: Time series data, tracking changes, showing trends
import matplotlib.pyplot as plt
months = ['Jan', 'Feb', 'Mar',
'Apr', 'May', 'Jun']
sales = [15000, 18000, 16500,
21000, 23500, 25000]
plt.figure(figsize=(8, 5))
plt.plot(months, sales,
marker='o',
linewidth=2,
color='blue',
label='Sales')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.title('Monthly Sales Trend')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
Use Case: Examine relationships between two continuous variables
When to Use: Correlation analysis, finding patterns, identifying clusters
import matplotlib.pyplot as plt
import numpy as np
# Sample data: study hours vs exam scores
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [50, 55, 60, 65, 75, 80, 85, 90]
plt.figure(figsize=(8, 5))
plt.scatter(study_hours, exam_scores,
s=100, color='green',
alpha=0.6, edgecolors='black')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.title('Study Hours vs Exam Score')
plt.grid(True, alpha=0.3)
plt.show()
Use Case: Compare values across different categories
When to Use: Category comparisons, survey results, rankings
import matplotlib.pyplot as plt
# Sample data: product sales by category
categories = ['Electronics', 'Clothing',
'Food', 'Books', 'Toys']
sales = [45000, 32000, 28000, 18000, 15000]
plt.figure(figsize=(8, 5))
plt.bar(categories, sales,
color=['#FF6B6B', '#4ECDC4',
'#45B7D1', '#FFA07A', '#98D8C8'])
plt.xlabel('Product Category')
plt.ylabel('Sales ($)')
plt.title('Sales by Product Category')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Use Case: Show distribution and frequency of a single variable
When to Use: Understanding data distribution, identifying skewness, finding ranges
import matplotlib.pyplot as plt
import numpy as np
# Sample data: student ages in a class
ages = np.random.normal(20, 2, 100)
plt.figure(figsize=(8, 5))
plt.hist(ages, bins=15, color='purple',
alpha=0.7, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Student Ages')
plt.grid(True, alpha=0.3, axis='y')
plt.show()
Use Case: Show proportions and percentages of a whole
When to Use: Market share, budget allocation, composition analysis
import matplotlib.pyplot as plt
# Sample data: budget allocation
categories = ['Marketing', 'R&D',
'Operations', 'HR', 'IT']
budget = [30, 25, 20, 15, 10]
colors = ['#FF9999', '#66B2FF', '#99FF99',
'#FFCC99', '#FF99CC']
plt.figure(figsize=(8, 6))
plt.pie(budget, labels=categories,
autopct='%1.1f%%',
startangle=90, colors=colors)
plt.title('Department Budget Allocation')
plt.axis('equal')
plt.show()
Use Case: Show distribution, median, quartiles, and outliers
When to Use: Comparing distributions, identifying outliers, statistical summary
import matplotlib.pyplot as plt
import numpy as np
# Sample data: test scores
class_a = np.random.normal(75, 10, 50)
class_b = np.random.normal(80, 8, 50)
class_c = np.random.normal(70, 12, 50)
data = [class_a, class_b, class_c]
plt.figure(figsize=(8, 5))
plt.boxplot(data,
labels=['Class A', 'Class B',
'Class C'],
patch_artist=True)
plt.ylabel('Test Scores')
plt.title('Test Score Distribution by Class')
plt.grid(True, alpha=0.3, axis='y')
plt.show()
Use Case: Visualize matrix data, correlations, or intensity values
When to Use: Correlation matrices, confusion matrices, time-based patterns
import matplotlib.pyplot as plt
import numpy as np
# Sample data: correlation matrix
data = np.random.rand(5, 5)
labels = ['Math', 'Science', 'English',
'History', 'Art']
plt.figure(figsize=(8, 6))
plt.imshow(data, cmap='YlOrRd',
aspect='auto')
plt.colorbar(label='Correlation')
plt.xticks(range(5), labels, rotation=45)
plt.yticks(range(5), labels)
plt.title('Subject Correlation Heatmap')
plt.tight_layout()
plt.show()
Use Case: Display multiple plots side by side for comparison
import matplotlib.pyplot as plt
import numpy as np
# Sample data
x = np.linspace(0, 10, 100)
fig, axes = plt.subplots(2, 2,
figsize=(10, 8))
# Plot 1: Line
axes[0, 0].plot(x, np.sin(x), 'b-')
axes[0, 0].set_title('Sine Wave')
# Plot 2: Scatter
axes[0, 1].scatter(x, np.cos(x),
c='red', alpha=0.5)
axes[0, 1].set_title('Cosine Scatter')
# Plot 3: Bar
axes[1, 0].bar(['A', 'B', 'C'], [3, 7, 5])
axes[1, 0].set_title('Bar Chart')
# Plot 4: Histogram
axes[1, 1].hist(np.random.randn(1000),
bins=30, color='green',
alpha=0.7)
axes[1, 1].set_title('Histogram')
plt.tight_layout()
plt.show()
Built on top of Matplotlib with a high-level interface for drawing attractive statistical graphics
Provides beautiful default styles and color palettes
Designed to work seamlessly with pandas DataFrames
Specialized for statistical visualizations with less code
Installation:
Import with Alias:
Note
Seaborn is built on Matplotlib, so you’ll often use both libraries together. Seaborn for high-level plotting and Matplotlib for fine-tuning.
| Feature | Matplotlib | Seaborn |
|---|---|---|
| Level | Low-level, more control | High-level, simpler syntax |
| Default Style | Basic, requires customization | Beautiful out-of-the-box |
| Statistical Plots | Requires manual calculation | Built-in statistical functions |
| Pandas Integration | Manual data preparation | Direct DataFrame support |
| Code Length | More verbose | More concise |
| Use Case | Full customization needed | Quick statistical visualization |
Best Practice
Use Seaborn for initial exploration and statistical plots, then switch to Matplotlib when you need fine-grained control.
Seaborn organizes plots into categories based on their purpose:
| Category | Purpose | Key Functions |
|---|---|---|
| Relational | Relationships between variables | scatterplot(), lineplot() |
| Distributional | Distribution of variables |
histplot(), kdeplot(),
boxplot()
|
| Categorical | Categorical comparisons |
barplot(), countplot(),
boxplot()
|
| Regression | Statistical relationships | regplot(), lmplot() |
| Matrix | Matrix data visualization | heatmap(), clustermap() |
Seaborn provides built-in themes to quickly change plot appearance:
import seaborn as sns
import matplotlib.pyplot as plt
# Available styles:
# 'darkgrid', 'whitegrid', 'dark',
# 'white', 'ticks'
# Set style
sns.set_style("whitegrid")
# Set context for scaling
# 'paper', 'notebook', 'talk', 'poster'
sns.set_context("talk")
# Set color palette
sns.set_palette("husl")
Style Examples: - darkgrid: Dark background with grid - whitegrid: White background with grid - dark: Dark background, no grid - white: White background, no grid - ticks: White with ticks on axes
Context Examples: - paper: Smallest (for papers) - notebook: Default size - talk: Larger (for presentations) - poster: Largest (for posters)
import seaborn as sns
import matplotlib.pyplot as plt
# Qualitative palettes (categorical)
sns.color_palette("Set2")
sns.color_palette("Paired")
# Sequential palettes (continuous)
sns.color_palette("Blues")
sns.color_palette("rocket")
# Diverging palettes (two extremes)
sns.color_palette("coolwarm")
sns.color_palette("vlag")
# Custom palette
custom = ["#FF6B6B", "#4ECDC4", "#45B7D1"]
sns.set_palette(custom)
Use Case: Visualize distribution of a single variable with histogram
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)
# Create distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data, kde=True,
color='skyblue', bins=30)
plt.title('Distribution Plot with KDE')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
KDE (Kernel Density Estimate)
Shows smooth probability density curve overlaid on histogram.
Use Case: Compare distributions across categories, identify outliers
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
df = pd.DataFrame({
'Category': ['A']*50 + ['B']*50 + ['C']*50,
'Values': np.concatenate([
np.random.normal(20, 5, 50),
np.random.normal(30, 7, 50),
np.random.normal(25, 4, 50)
])
})
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Category',
y='Values', palette='Set2')
plt.title('Box Plot Comparison')
plt.show()
Use Case: Show distribution shape with more detail than box plot
Violin vs Box Plot
Violin plots show the full distribution shape (density), while box plots show quartiles and outliers.
Use Case: Compare means across categories with confidence intervals
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
df = pd.DataFrame({
'Category': ['A', 'B', 'C', 'D', 'E'],
'Mean': [23, 45, 32, 38, 41],
'StdDev': [3, 5, 4, 3, 6]
})
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='Category',
y='Mean', palette='viridis',
errorbar='sd')
plt.title('Bar Plot with Error Bars')
plt.ylabel('Average Value')
plt.show()
Use Case: Show frequency of categorical variables
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
df = pd.DataFrame({
'Category': ['A']*45 + ['B']*32 +
['C']*28 + ['D']*15 + ['E']*23
})
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Category',
palette='pastel')
plt.title('Count Plot - Frequency Distribution')
plt.ylabel('Count')
plt.show()
Use Case: Show relationship between two variables with trend line
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample data
np.random.seed(42)
x = np.random.rand(100) * 100
y = 2 * x + np.random.randn(100) * 15
plt.figure(figsize=(10, 6))
sns.regplot(x=x, y=y,
scatter_kws={'alpha':0.5},
line_kws={'color':'red'})
plt.title('Scatter Plot with Regression Line')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.show()
Use Case: Visualize correlation matrix or 2D data
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample correlation matrix
np.random.seed(42)
data = np.random.rand(5, 5)
labels = ['A', 'B', 'C', 'D', 'E']
plt.figure(figsize=(10, 8))
sns.heatmap(data, annot=True, fmt='.2f',
cmap='coolwarm',
xticklabels=labels,
yticklabels=labels,
cbar_kws={'label': 'Correlation'})
plt.title('Correlation Heatmap')
plt.show()
Use Case: Explore relationships between all variable pairs
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load iris dataset
iris = load_iris(as_frame=True)
df = iris.frame
df['species'] = df['target'].map({
0: 'setosa',
1: 'versicolor',
2: 'virginica'
})
# Create pair plot
sns.pairplot(df, hue='species',
palette='Set1')
plt.show()
Pair Plot
Shows scatter plots for all variable combinations and distributions on diagonal.
Use Case: Create multiple plots based on categorical variables
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Sample data
np.random.seed(42)
df = pd.DataFrame({
'x': np.random.rand(300) * 100,
'y': np.random.rand(300) * 100,
'category': np.repeat(['A', 'B', 'C'], 100)
})
# Create FacetGrid
g = sns.FacetGrid(df, col='category',
height=4)
g.map(sns.scatterplot, 'x', 'y')
g.add_legend()
plt.show()
Use Case: Combine scatter plot with marginal distributions
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample data
np.random.seed(42)
x = np.random.randn(500)
y = x + np.random.randn(500) * 0.5
# Create joint plot
sns.jointplot(x=x, y=y, kind='scatter',
marginal_kws={'bins': 30,
'color': 'skyblue'},
joint_kws={'alpha': 0.5})
plt.show()
Joint Plot Types
scatter: Scatter plot (default)hex: Hexbin plotkde: KDE contoursreg: With regression line
| Practice | Recommendation |
|---|---|
| Style |
Set style once at beginning:
sns.set_style("whitegrid")
|
| Context |
Use sns.set_context("talk") for
presentations
|
| Color Palette | Choose appropriate palette for data type (categorical/sequential/diverging) |
| Figure Size |
Set before plotting:
plt.figure(figsize=(10, 6))
|
| Data Format | Use pandas DataFrame for easy integration |
| Annotations | Add annot=True to heatmaps for values |
| Statistical Info |
Use ci parameter in barplot for confidence
intervals
|
| Combining Plots | Use FacetGrid or pairplot for multi-dimensional data |
| Customization | Combine with Matplotlib for fine-tuning |
| Documentation | Check seaborn.pydata.org for examples |
ICT center
Phally Makara
Python for Data Science | Machine Learning